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Abstract 


Multiple meta-analyses have now documented small positive effects of teacher professional 
development (PD) on pupil test scores. However, the field lacks any validated explanatory 
account of what differentiates more from less effective in-service training. As a result, 
researchers have little in the way of advice for those tasked with designing or commissioning 
better PD. We set out to remedy this by developing a new theory of effective PD based on 
combinations of causally active components targeted at developing teachers’ insights, goals, 
techniques, and practice. We test two important implications of the theory using a systematic 
review and meta-analysis of 104 randomised controlled trials, finding qualified support for 
our framework. While further research is required to test and refine the theory, we argue that 
it presents an important step forward in being able to offer actionable advice to those 


responsible for improving teacher PD. 
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Introduction 

Effective teachers improve pupil achievement, help close the gaps between rich and 
poor pupils, and increase pupil earnings in later life (Chetty et al., 2014; Hamre & Pianta, 2005; 
Slater et al., 2012). Policymakers and educators have therefore invested considerable time and 
money in trying to enhance the skills of the teaching workforce. As a result, teachers now spend 
an average of 10.5 days per year attending courses, workshops, conferences, seminars, 
observation visits, or other types of in-service training (Sellen, 2016). In parallel, governments 
worldwide have invested billions of dollars in research intended to find out how best to design 
this teacher professional development (PD; Boulay et al., 2018; Dawson et al., 2018). 

This investment has resulted in a marked increase in the number of rigorous studies 
quantifying the impact of different approaches to teacher PD (Edovald & Nevill, 2021; Hedges 
& Schauer, 2018). In 2007, a review by Yoon et al. found just nine such studies, in 2016 a 
review by Kennedy found 28 such studies, and in 2019 Lynch et al. found 95 such studies 
focused on science and maths alone. Recent meta-analyses of this literature tend to find average 
effect sizes of teacher PD on standardized test scores of around 0.06 (Lynch et al. 2019). On 
average, PD has small positive effects on the quality of teaching, as reflected in pupil learning. 

While much has been learned from this evaluation literature, fundamental questions 
remain. Most schools do not have access to the PD programmes that have so far been evaluated, 
either because they are not available on the open market, or are too geographically distant, or 
because of capacity constraints on providers. Moreover, meta-analysis suggests considerable 
variation in the impact of PD, depending on how the PD is designed (Basma & Savage; 2017; 
Didion et al., 2020; Kennedy, 2016; Kraft et al., 2018; Lynch et al. 2019). Policymakers and 
school leaders therefore need to know which characteristics of PD make it effective, so that 


they can design or commission the best PD available for their teachers (Hill et al., 2013). 
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Existing attempts to explain what differentiates more from less effective PD have not 
made much progress. One strand of the literature has employed narrative reviews and thematic 
analyses in an attempt to identify what differentiates more and less effective PD (Desimone, 
2009; Timperley et al., 2007; Wei et al., 2009). Indeed, some of the researchers working in this 
tradition have even claimed that the field has reached a consensus on the characteristics of 
effective PD (e.g. Darling-Hammond et al., 2017). However, the narrative reviews on which 
this claim is based have two important methodological weaknesses. First, many of them include 
studies employing non-equivalent control groups. Second, they lack any method for 
differentiating the causally active from causally inactive components of the PD (Sims & 
Fletcher-Wood, 2020). 

A second strand of research has used meta-regression to investigate the associations 
between different aspects of PD design and impact on pupil outcomes (Basma & Savage; 2017; 
Didion et al., 2020; Kraft et al., 2018; Lynch et al. 2019). However, there is presently little 
consensus on which specific characteristics of PD should be entered into such meta-regression 
models, with different papers testing different mediators (e.g., Kraft et al., 2018; Lynch et al. 
2019). Previous research has provided rich ways of conceptualising and categorising PD 
(Boylan & Demack, 2018; Kennedy, 2016; Opfer & Pedder, 2011 & Sztjan et al. 2011). 
However, existing theory offers few testable hypotheses about what makes PD more or less 
effective, which leaves researchers guessing as to how their meta-regression models should be 
specified and thus how the coefficients should be interpreted. 

In sum, we now know about the causal impact of a wide variety of PD programmes, 
but do not have much useful to say about what differentiates more from less effective PD. In 
Cummins’ words “we are overwhelmed with things to explain, and somewhat underwhelmed 
by things to explain them with” (Cummins, 2000). In this paper, we set out to remedy this by 


proposing and empirically testing a new theory of effective teacher PD. Like all theorising, our 
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goal is to explain why certain PD designs result in greater impact on teaching and learning. In 
doing so, we hope to provide a practical theory (Berkman & Wilson, 2021) that suggests 
actionable steps by which policymakers and school leaders can improve PD design. Our 
research team, which is composed of researchers and teacher educators, reflects this goal. 

In the next section of the paper, we begin by theorising about the four things that PD 
needs to achieve in order to secure improvements in teaching. The subsequent section then 
synthesizes a set of mechanisms for achieving each of these four purposes of PD. We also 
derive some testable implications of this theory. Next, we set out the methods by which we 
conducted a systematic review and meta-analysis in which we code 104 experimentally 
evaluated PD programmes for the presence or absence of each of these mechanisms. The results 


section then presents the findings from a number of meta-analytic tests of our hypotheses. 
Theorising how PD fails: Insights, Goals, Techniques, Practice 


Practical theory building should begin with a review of research providing rich 
descriptions of the target problem (Berkman & Wilson, 2021; Scheel et al., 2021). This 
supports the identification of important concepts, which can then be used as the building blocks 
of a new theory (Hempel, 1966). We take as our central problem the difficulty of designing PD 
that results in sustained improvements in practice (Copur-Gencturk & Papakonstantinou, 2016; 
Hanno, 2021; Hobbiss et al., 2021). In this section, we review a range of descriptive research 
drawing on data from surveys, longitudinal classroom observations, interviews and diary 
studies which, taken together, suggest four important building blocks for our framework. 

Teachers’ knowledge serves as the foundation on which they base decisions about their 
practice. Mixed methods studies illuminate various ways in which teachers’ knowledge 
influences their practice (Carpenter et al., 1989; Franke et al., 2001; Hill et al., 2008) and 
measures of teachers’ knowledge also correlate with estimates of teacher effectiveness (Hill & 


Chin, 2018). This suggests that one way in which PD might fail to improve teaching and 
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learning is by failing to bring about changes in teachers’ knowledge and understanding. This 
might happen because the knowledge provided by the PD is inaccurate or irrelevant, or - as 
longitudinal research with teachers has documented - because new learning tends to be 
forgotten over time (Arzi & White, 2008; Liu & Phelps, 2020). The first building block of our 
theory is therefore insight, which we define as teachers gaining an enhanced or expanded 
understanding of teaching and learning. 

Knowledge alone is unlikely to bring about changes in practice (Lord et al., 2017; 
Kennedy, 2016). For example, diary studies have found that teachers report 50% of school- 
based learning experiences result in changes in their knowledge and beliefs, but in only a 
quarter of these cases do these changes in beliefs feed through into changes in their intended 
practice (Bakkenes et al., 2012). A systematic review of studies on formative assessment also 
found that PD is less likely to feed through into intentions to change practice in the absence of 
reinforcement from school leaders (Yan et al., 2021). Hence, PD might also fail to improve 
teaching if it does not motivate teachers to adopt goals around changing their practice. The 
second building block of our theory is therefore goals, which we define as motivating a teacher 
to consciously pursue a specific change in their practice. 

Another point at which PD might fail is around teachers enacting what they have 
learned in the classroom. For example, a three-year study found that early-career science 
teachers espoused strong beliefs in the importance and value of student-centred teaching 
methods but that this was often not reflected in their classroom practice (Simmons et al., 1999). 
Tightly controlled laboratory studies show that knowledge of classroom management 
techniques and formative assessment practices is often insufficient to bring about changes in 
teachers’ practice (Cohen & Wiseman, 2019; Cohen et al., 2020). However, when similar 
teachers are also given feedback on, and practice with, the target skill then this results in 


improvements in practice (Cohen et al., 2021). PD can therefore also fail when it neglects to 
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provide teachers with the necessary skills. Our third building block is therefore developing 
technique, which we define as helping a teacher to utilize a new teaching practice. 

Descriptive research also illuminates the difficulties in embedding change. For 
example, Copur-Gencturk & Papakonstantinou (2016) collected detailed observations on a 
group of teachers over a four-year period following a mathematics PD programme. The results 
show how PD can bring about initial changes in practice but this subsequently fades over time. 
Other studies have documented similar patterns of ‘fade-out’. Boston & Smith (2011) report 
case studies illustrating how some teachers who implemented cognitively challenging maths 
instruction immediately after a PD programme no longer did so in a follow-up observation. 
Similarly, Hanno (2021) uses repeated classroom observation to how some improvements in 
practice dissipate quickly. The final building block for our theory is therefore embedding 
practice, which we define as supporting a teacher to consistently make use of some technique 
in the classroom. 

In summary, we propose that PD needs to pay careful attention to four things if it is to 
bring about sustained improvements in teaching practice. First, it needs to provide insight (1) 
about teaching and learning. For example, a teacher might learn that working memory is 
composed of separate visual-spatial and phonological systems, each of which has limited 
capacity (Baddeley & Hitch, 1974). Second, PD should motivate teachers to adopt goal- 
directed (G) changes in practice. For example, a teacher might resolve to limit the cognitive 
load their exposition of a subject places on either the visual-spatial or the phonological system 
within working memory. Third, PD should provide techniques (T) for putting these insights to 
work. For example, a teacher might invite pupils to read text from the board in silence, rather 
than also reading out the text, in order to avoid overloading the phonological loop with both 


written and aural input. Fourth, PD must embed that change in practice (P). For example, a 
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teacher might use the ‘read silently from the board’ technique multiple times, across different 
classes, until it becomes a routine part of their practice. 

Table 1 summarizes our thinking about how PD can fail if (combinations of) these four 
purposes of PD are not addressed. If PD brings about the necessary changes to I and perhaps 
G, but not to T and P (row 2 and 3), then this is unlikely to change classroom practice - known 
in the teacher education literature as the ‘knowing-doing gap’ (Knight et al., 2013). If PD brings 
about the necessary changes to I, G, and T, but not P, then teachers will tend to revert to 
established routines (row 4). This reflects the extensive literature on the importance of 
automaticity and habits in teachers’ practice (Feldon, 2007; Hobbiss, 2021). Finally, if PD 
brings about the necessary changes to G, T and P, but not I (row 5) then PD has failed to provide 
an understanding of why (and when) a particular practice is effective. This can lead to 
misapplication of a technique in a way that renders it ineffective (Kennedy, 2016; Mokyr, 
2002), sometimes referred to as a ‘lethal mutation’ in the education literature (Brown & 
Campione, 1996, p.259). By contrast, we theorize that when PD succeeds in addressing I, G, T 
and P, it is more likely to be effective. 

Theorising how PD succeeds: mechanisms 

Having theorized the different ways in which PD might fail, we now turn to consider 
how PD might successfully address all four of insights, goals, techniques, and practice. 
Which design features should PD incorporate in order to address all four of these purposes? 
As previously noted, an important challenge here is in differentiating the causally active from 
the causally inactive components of a PD design (Mackie, 1974). After all, associations 
between particular components of PD and the effects of that PD on pupil outcomes could be 
spurious. Yet a practical theory, capable of providing actionable advice for the design of 
better PD, requires the associations to reflect an underlying causal relationship. We refer to 


these causally active components of a PD programme as mechanisms, in that comprise the 
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entities and activities (causally) responsible for bringing about the effects of that PD on 
teaching and learning (Illari & Williamson, 2012, p.14). 

We theorize something to be a causally active component of PD only if we can find 
causal evidence that it helps achieve I, G, T or P from across multiple domains (Sims & 
Fletcher-Wood, 2021). Our reasoning here is simple: if a mechanism x helps to achieve I, G, 
T or P in multiple domains beyond teacher PD and we also observe an association between 
the presence of x in PD programmes and the impact of those PD programmes, then x is likely 
also a causally active component in PD. This type of reasoning - known as analogical 
abduction - is commonly used in developing explanatory theories: “if one finds a similar set 
of phenomena in another field that is better understood, then one can ‘borrow’ explanatory 
principles from that field to inform one’s own” (Borsboom et al., 2021, p.761). In developing 
our list of mechanisms, we draw heavily on empirical findings from cognitive science, 
behavioural science (Michie et al., 2013), and the literature on training medical doctors. We 
searched the literature for mechanisms that a) have sufficient empirical support across 
multiple domains and b) provide an explanatory account of how they affect I, G, T or P. 

With respect to insight (I) - teachers gaining an enhanced understanding of teaching 
and learning - we found two such mechanisms. The first is to manage the cognitive load for 
the teachers taking part in the PD. This can be achieved by focusing on a single idea or task, 
removing redundant information, or by providing worked examples, all of which help to 
prevent working memory from becoming overloaded. For causal evidence that this helps with 
learning new material among school students and adult medical trainees, see the reviews by 
Sweller et al. (2019) and Fraser et al. (2015). The second mechanism is revisit material, 
which can be achieved by reteaching or prompting recall of important ideas on separate 


occasions, both of which help to strengthen memory. Causal evidence that this aids with 
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learning in lab settings, as well as in history, maths and language learning at school, can be 
found in the reviews by Adesope et al. (2017), Rohrer (2015), and Yang (2021). 

As regards goals (G) - motivating a teacher to pursue a specific change in their 
practice - we found three putative mechanisms, all of which were taken from Michie et al. 
(2013). The first is to go through an explicit goal setting process, in which teachers 
consciously agree on an objective around changing a specific part of their practice. This 
works by directing attention and energy toward the target change (Locke & Latham, 2002). 
Epton et al. (2017) provides a review of evidence that goal setting brings about change in 
sporting, health-related and educational settings. The second mechanism is to present 
evidence supporting the change from a credible source, by which we mean findings from 
empirical research. For reviews of evidence that statistical evidence or justified arguments 
help change people’s minds and intentions in setting including health, crime and education, 
see O’Keefe (1998) and Hornikx (2005). The third mechanism is reinforcement, which can 
be achieved through praising or restating the value of a certain teaching practice. This has 
been shown to increase motivation in domains including arts, games and maths (Delin & 
Baumeister, 1998). 

With respect to technique - helping a teacher to utilize a new teaching practice - we 
found five mechanisms that met our criteria: instruction, practical social support, modelling, 
feedback, and rehearsal (Michie et al., 2013). Practical social support involves arranging 
advice on how to implement a practice from a teacher’s colleagues. Causal studies show that 
this supports practice change in medical training (Grierson et al., 2012) and in various health 
behaviour settings (Dale et al., 2012; Jolly et al., 2012; Ramchand et al., 2017). Modelling 
involves providing an observable example of the target teaching practice, which provides a 
visual guide for subsequent practice (Renkl, 2014). Many experimental studies in the medical 


education literature have found that modelling helps with acquisition of new clinical 
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(Cordovani & Cordovani, 2016) and surgical skills (Harris et al., 2018). The remaining three 
techniques mechanism are instruction, feedback, and rehearsal and (for space reasons) are 
discussed in full in Appendix A. 

Finally, with respect to embedding practice - supporting a teacher to consistently 
make use of some technique - we found four potential mechanisms that met our criteria 
(Michie et al., 2013). Action planning involves specifying when and how a change in practice 
will be made in a future lesson. This creates situational cues that help trigger new practice 
(Webb & Sheeran, 2008) and has been shown to help change practice in health, education 
and lab settings (Gollwitzer & Sheeran, 2006). Context specific repetition refers to rehearsing 
the target practice in a realistic classroom setting. This helps overwrite existing cue-response 
relationships (habits) by re-associating the classroom setting with the new practice (Hobbiss 
et al., 2021). Experimental studies have shown that rehearsal in realistic simulators for 
surgical trainees (even without feedback from an observer) leads to improved practice on a 
delayed post-test (Andreatta et al., 2006; Van Sickle et al., 2008). Experimental studies have 
also found that interventions focused on overwriting old habits can help embed health 
behaviour change (Carels, 2011). The remaining two practice mechanisms - prompts/cues 
and self-monitoring - are discussed full in Appendix A. 

Table 2 summarizes the mechanisms across the four (IGTP) purposes of PD. Three 
clarificatory points are in order. First, while we have searched extensively, and have included 
every mechanism for which we could find sufficient supporting evidence, this list is unlikely 
to be complete. Indeed, even if we have identified every relevant mechanism documented in 
the existing literature, future research may identify additional relevant mechanisms. Second, 
mechanisms within each row of the table can be thought of as substitutes for each other, in that 
they achieve the same thing. However, they are also likely to have a cumulative effect. For 


example, incorporating managing cognitive load and revisiting prior learning in a single PD 
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programme would likely contribute more to increased insight than only incorporating revisit 
material. Third, we make no assumptions about the size of the effects of the different 
mechanism. Our argument is only that improvements in e.g. technique are an increasing 
function of the number of technique mechanism incorporated in a given PD programme. This 


assumption is formalized in Appendix B. 


Hypotheses 
Having set out our theory, we now derive two hypotheses that will be tested in the remaining, 
empirical sections of the paper. Since we theorize that the fourteen mechanisms listed in 
Table 2 are all causally active components of PD with cumulative effects, we hypothesize 
that: 


H1: The number of mechanisms incorporated in PD programmes will be positively 


correlated with the impact of those PD programmes on pupil test scores. 


In addition, since we theorized that PD is likely to be more effective if it addresses all four 


purposes of PD, we hypothesize that: 


H2: PD programmes that incorporate at least one mechanism in each of the four 


I/G/T/P categories (a ‘balanced design’) will have a larger impact on pupil test scores. 


Our first hypothesis and the definition of a balanced design are formalized in Appendix B. 


Methods 
Systematic Review 
We systematically searched the literature to identify primary research studies that 
could be used to test these hypotheses. We included studies in our meta-analysis if they met 
all of the following criteria: 1) they focused on qualified teachers working in formal 
education settings with children 3-18 years of age; 2) they evaluated a teacher PD 
programme, defined as structured, facilitated activity intended to improve their teaching 


ability; 3) the evaluation employed a randomized controlled trial (RCT) design, thus allowing 
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clean causal inference; 4) the control group in the RCT received either business as usual or no 
PD; 5) the evaluation measured outcomes using a standardized (not researcher designed) test 
score outcome, to increase the comparability of effect size estimates (Cheung & Slavin, 
2016); 6) the study was published during or after 2002; 7) the study was written in English 
and conducted in an OCED country. 

We employed various combinations of search terms intended to capture three main 
concepts: (1) teachers (e.g. ‘teachers’, ‘educators’ ); (2) professional development (e.g. ‘in- 
service training’, ‘professional learning’); and (3) randomized controlled trials (e.g. ‘RCT’).! 
We used these terms to query eleven different databases and search engines during November 
2020.” In addition, we searched the reference lists of eleven previous reviews,’ employed 
reference-checking and forward citation searching of included studies,* and browsed eight 
websites containing education research repositories.° All records were uploaded into the 
EPPI Reviewer software, deduplicated and then screened on title and abstract using 
prioritized screening (O’Mara-Eves et al., 2015; Thomas et al., 201 1).° All studies included at 
this stage were then reviewed in full. This process resulted in 121 eligible experimental 
studies (see the PRISMA flow diagram in Appendix C for further details). 

We extracted Cohen’s d effect sizes for each of the studies in our sample using the 
formulae from Lipsey & Wilson (2001). This was possible for 104 or the 121 studies. 
Cohen’s d is known to display small bias in small studies and can be corrected using Hedges’ 
g (Hedges, 1981). However, Hedges’ g could not be calculated for two of the 104 studies due 
to missing data. We therefore present all our results using Cohen’s d, on the basis that losing 
studies from the meta-analysis is highly undesirable.’ In cases where eligible studies reported 
multiple standardized test score outcomes, we selected the primary tests core outcome (if 
specified), or else collected all standardized test score outcomes. Contour plots, trim-and-fill 


and p-curve analysis all suggested either zero or small publication bias.® 
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In addition to effect sizes, we coded the studies based on whether the PD incorporated 
each of our fourteen mechanisms. In cases where eligible studies reported evaluations of 
multiple versions of a PD programme, we focused on the most intensive version.” 

Then two authors (SS and HFW) double coded 46 papers using this coding frame and 
achieved 82% agreement at the mechanism level. The two coders met to discuss 
discrepancies until consensus was reached. The coding frame was then revised to further 
eliminate ambiguity and to support consistent coding.!° The remaining papers were then 
coded for mechanisms by a single author (HFW). In our empirical analysis, we test the 
sensitivity of our main results to the presence of measurement error using errors-in-variables 
regression, using the 82% figure as the best available assumption for the reliability with 
which our mechanisms are measured. 

Figure | provides descriptive statistics about the mechanisms. The left-hand panel 
shows the number of mechanisms per PD programme. We observe a minimum of zero 
mechanisms (in just one PD programme) and a maximum of 13 (again in just one PD 
programme). The median number of mechanisms is five and there is a long right tail of 
mechanism-rich programmes. The right-hand panel shows the frequency with which each of 
the 14 mechanisms occur. All of our mechanisms occur at least once, with prompts/cues 
being the least common and instruction being the most common. The techniques (T) 
mechanisms are the most frequently occurring. 

We collected three further types of information from each study. First, we coded for 
the geographic location, age group and subject focus for each experiment. Second, we coded 
for the “broad area of focus’ of the PD, based on whether the content of the PD was largely 
based on cognitive science, formative assessment, inquiry learning, or data-driven 
instruction.'! Second, we coded for four important indicators of study quality: whether the 


experiment was pre-registered; whether the RCT met the What Works Clearinghouse 
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‘cautious’ standards for acceptable attrition; whether the study randomized more than 50 
units to treatment and control; and whether the study employed a high-stakes test score 
outcome (Cheung & Slavin, 2016).!? Table 3 summarizes the characteristics of the 104 


studies included in our meta-analytic sample. 


Meta-analytic tests 


To calculate meta-analytic average effect sizes, we used robust variance estimation 
(RVE) random effect meta-analysis (Hedges, Tipton & Johnson, 2010; Tanner-Smith & 
Tipton, 2013). It was not possible to regress the effect sizes on all 14 mechanisms separately 
due to sample size constraints, further compounded by likely interactions between the various 
mechanisms. To test H1, we therefore plot the impact of all 104 PD programmes on test 
scores (expressed as an effect size) against the number of mechanisms per programme, and 
then add a meta-regression (precision weighted) line of best fit. These plots have the 
advantage of conveying more information about the underlying data than meta-regression 
tables. Since it is not possible to produce these plots using RVE, we use the primary outcome 
(if specified), or else one randomly chosen outcome per PD programme. Where the results of 
the RVE analysis are qualitatively different, we highlight this in the text. We then repeat this 
analysis a number of times, stratifying the data based on the broad content area of the PD and 
various indicators of study quality. One important caveat about these plots is that the 
experimental impact estimates on the Y axis all contain random (classical) measurement 
error. This artificially increases the variance on the y axis and, by extension, reduces the 
proportion of variance explained by the model. To test H2 we simply plot the interval 
estimates using RVE meta-analysis and all the standardized test score outcomes. 

The mechanism incorporated in each PD programme in our sample are not themselves 
randomly assigned, meaning that our meta-analysis cannot estimate the causal effects of 


those mechanisms. So how can our study provide actionable advice to educators looking to 
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improve the design of PD? When asked how observational studies can move beyond 
correlation to causation, Fisher advised researchers to “make their theories elaborate”. The 
rationale for this is that empirical corroboration of a complicated pattern of predictions helps 
rule out alternative explanations (quoted in Rosenbaum, 2005, p. 8). With respect to H1, our 
theory synthesizes empirical evidence that our mechanisms are causally active in a range of 
other domains, which makes it less plausible that an association within the domain of PD 
programmes is spurious. H2 is also elaborate in the Fisherian sense. Why else would a PD 
programme with at least one mechanism in each of the I/G/T/P categories be more effective 
than e.g. a programme with at least one mechanism in three but not four of the categories? 
We also pre-registered both of our hypotheses prior to data collection, making the subsequent 
empirical analysis a genuinely risky test of our theory (Mayo, 2018).'° 
Results 

Our first test of H1 can be found in Figure 2, which plots the number of mechanisms 
against the impact estimate (scaled as an effect size) for all 104 PD programmes in our 
sample. Larger circles representing studies with more precise estimates, with the size being 
proportional to the weight they are given in the analysis. The meta-regression line of best fit 
is upward sloping (2= 0.01, p = .02). PD interventions incorporating zero mechanisms have 
an expected effect size close to zero and PD mechanisms incorporating 13 mechanisms have 
an expected effect size close to .15. For context, the average effect size in our sample is 0.05 
(p < 0.001), implying the number of mechanisms incorporated in PD can account for 


variation equivalent to three times the average effect. 


Figure 2 suggest a considerable degree of unexplained variation. Figure 3 therefore 
stratifies the analysis based on the broad content area of the PD. The proportion of variance 
explained doubles from 16% to 36% - although both of these will be underestimates due to 


classical measurement error on the y axis. The gradient for formative assessment increases to 
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.02 but is no longer statistically significant at conventional levels (p = .09). The gradient for 
inquiry also increases to 0.03 (p=.046). However, the relationship among PD programmes 
focused on data-driven instruction breaks down entirely (2 = -0.01, p = .64), albeit in a 
sample of just seven studies. For context, the average impact of PD focused on formative 
assessment (d = .04, p = .08) and data-driven instruction are not significantly different from 
zero in general (d = .04, p = .08). By contrast, the average impact of PD focused on inquiry is 


positive (d = .07, p= .01). 


Our final analysis relating to H1 is to test the sensitivity of the hypothesized 
relationship to various indicators of study quality and treatment heterogeneity (Figure 4). We 
find a similar relationship among studies using high-stakes test scores (Panel 1, B= .01, p= 
.04), among large trials (Panel 3, B= .01, p = .04) and among PD programmes that do not 
include sets of new curriculum materials (Panel 5, f= .01, p = .03). We also find a very 
similar gradient in our errors in variables regression (Panel 6, £= .01, p = .12) and in studies 
with low attrition (Panel 2, G= .01, p = .06). These last two results are no longer statistically 
significant at conventional levels, however the result for attrition is significant when 
estimated using RVE (f= .02, p = .02), which suggests it is marginal. 

The most concerning part of Figure 4 is the panel for pre-registered studies, in which 
both the gradient and p value break down (Panel 4, G= .004, p = .32). In principle, the 
absence of a relationship among pre-registered studies could be explained by: p-hacking in 
trials that are not pre-registered; otherwise higher methodological standards in pre-registered 
trials; or inferior selection of PD programmes by the types of funders that require pre- 
registration. We probe these potential explanations further in Appendix F. Our p-curve 
analysis does not indicate motivated p hacking in our sample (Simonsohn et al., 201 4a; 
Simonsohn et al., 2014b). Our comparison of methodological standards shows that pre- 
registered trial are indeed more likely to use high-stakes test score outcomes and have lower 
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attrition (indicators of higher methodological standards). We also find some evidence that 
pre-registered PD is slightly less well designed, as indicated by the number of mechanisms 
that they incorporate. 

We now turn to our empirical tests of H2. Figure 5 shows the average impact of five 
separate groups of PD. On the left are PD programmes that incorporate mechanisms 
addressing all four I/G/T/P purposes of PD (‘balanced designs’). To the right of that are PD 
programmes that incorporate mechanism(s) addressing three or fewer purposes of PD 
(‘imbalanced designs’). PD with an imbalanced design has an average impact of .05, 
regardless of whether it addresses 1, 2, or 3 purposes of PD. By contrast, PD with a balanced 
design has an average impact of .15 (p = .03). However, the 95% confidence interval for 
balanced PD programmes is wide and overlaps with the confidence interval for all 
imbalanced designs (p = .22). There are two reasons that the confidence interval is much 
wider on the balanced PD plot. First, there are fewer studies in the balanced (n=9) versus 
imbalanced plots (n=95). Second, there is more heterogeneity — captured by the standard 
deviation of the effect sizes (tT) among the effect sizes in the balanced (tT = .1) versus 


imbalanced plots (t = .05). We return to this point in the discussion section. 


Our theoretical framework encodes a set of assumptions about which mechanisms 
address which of the four purposes of PD. While we pre-registered the overall I/G/T/P 
framework, and we believe the match between mechanism and purposes to be well-grounded 
in theory, we did not pre-register our list of mechanisms. Figure 6 therefore checks the 
sensitivity of our findings to reallocating mechanisms in four cases where our assumptions 
might be arguable. First, we reallocated the feedback mechanism to the insight (I) purpose. 
Second, we reallocated the credible source mechanisms to the insight (I) purpose. Third, we 
reallocated the praise/reinforce mechanism to the embed practice (P) purpose. Fourth, we 


reallocated the context-specific repetition mechanism to the techniques (T) purpose. The 
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results are qualitatively similar to those in Figure 5, suggesting that our findings are not 
particularly sensitive to these assumptions. 
Discussion 

We set out to develop and test a practically useful theory of effective teacher PD. To 
do so, we developed an account of four ways in which PD might fail to bring about sustained 
improvements in teaching practice. Against this, we synthesized a set of mechanisms 
hypothesized to be causally active in addressing each of these four purposes of PD. How 
successful has our research been in achieving this goal? Various frameworks have been 
suggested for evaluating theories (Kuhn, 1977; Gawronski & Bodenhausen, 2015; Van 
Lange, 2015). While there are differences in emphasis and language, all of these frameworks 
emphasize the importance of: abstraction/parsimony; plausibility/coherence; explanatory 
power; usefulness/applicability; and progress/fruitfulness. In the remainder of the paper, we 
assess the strength and limitations of our theoretical framework against these criteria. 

Parsimony requires that theories abstract away from empirical detail, using the fewest 
assumptions or components necessary to explain the target phenomenon (Gawronski & 
Bodenhausen, 2015; Eronen & Bringmann, 2021). Our top-level framework involves just 
four components: insights, goals, techniques, practice. As set out in Table 1, this simple set- 
up allows us to account for a range of phenomena documented in the PD literature, including 
the knowing/doing gap, the importance of habits, and lethal mutations. Within each of the 
1/G/T/P categories, our framework includes between two and five mechanisms, which 
collectively allowed us to characterize and capture considerable variation in our set of 104 
experimentally evaluated PD interventions. We found some instances of all fourteen 
mechanisms and one intervention containing 13 of these 14 mechanisms, suggesting the 
framework is not overly elaborate relative to current PD design. We were also able to 


formalize several aspects of our framework (Appendix B). 
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The coherence and plausibility of a theory depends on its degree of fit with existing 
knowledge or other, well-corroborated, theory (Scheel et al., 2021). An important and 
distinctive feature of our framework is the requirement that each mechanism be supported by 
empirical causal evidence from multiple domains. Our theory is therefore closely integrated 
with empirical findings from the health psychology, cognitive science and medical education 
literatures. It also builds on cognate theoretical frameworks (e.g. Michie et al., 2013). While 
we were careful to only include mechanisms for which we found sufficient supporting 
empirical evidence, we acknowledge that the strength of the evidence varies across our 
mechanisms. Some mechanisms have strong, direct supporting evidence from very many 
settings (e.g. goal setting), while others have evidence from fewer domains (e.g. rehearsal) or 
from fewer good studies (e.g. credible source). In future, mechanism should be added if basic 
research discovers sufficient evidence for new mechanisms; or removed if new research 
brings into doubt the evidence on which they are currently included. 

A theory has explanatory power if it can provide an accurate account of how and why 
something occurred by citing earlier events (Cummins, 2000; Elster, 2015). Our theory 
achieves this in two senses. In a qualitative sense, it provides an account of how PD succeeds 
or fails to improve teaching practice via changes in I/G/T/P, brought about by the 
mechanisms incorporated in the PD. We were careful to provide such an account both for 
how different combinations of I/G/T/P affect teaching practice and for how each individual 
mechanism affects I/G/T/P. In a quantitative or statistical sense, our meta-analysis showed 
that the number and combination of mechanisms incorporated in the PD can explain variation 
in effects between 0 and 0.15 standard deviations — a range equivalent to three times the 
average impact of PD. Crucially, we argue that the persuasiveness of our account derives 


from the combination of these two types of evidence: independent causal evidence that each 
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of our mechanisms is causally active in various other domains, plus evidence of an 
association between PD incorporating those mechanisms and the impact on test scores. 

Having said that, we acknowledge that our empirical findings come with two 
important caveats. The first relates to the wide confidence intervals on the estimate for 
balanced designs. This is likely largely due to the smaller number of evaluations (n=9) for PD 
with a balanced design. Further experimental evaluations of PD are therefore needed in order 
to provide a more precise test of this hypothesis. The second caveat relates to the absence of a 
statistically significant relationship between the number of mechanism and the impact of the 
PD among the subset of pre-registered studies. Our additional analysis in Appendix F 
suggests that this likely reflects greater use of high-stakes test scores, lower attrition, and 
slightly weaker PD designs among pre-registered evaluations, but not p-hacking. Assuming 
this is correct, we would expect this to reduce the gradient we observe by shrinking variation 
on the y axis. This is broadly consistent with our finding that PD evaluations that are pre- 
registered (d = .01) have lower effect sizes than PD evaluations in general (d =.05; Appendix 
E) and with similar findings from the broader education literature (Kraft, 2020). 4 Our theory 
could be further stress-tested here by conducting pre-registered A/B tests in which the same 
PD content is delivered using low-mechanism and high-mechanism designs. 

For a theory to be practical it should point towards actionable steps for solving a real- 
world problem (Berkman & Wilson, 2021). The weight of the evidence presented here 
suggests that PD incorporating more mechanisms should be favoured over PD incorporating 
fewer mechanisms on the grounds that it is more likely to be effective, other things equal. 
Likewise, it seems hard to explain the pattern of results found in Figure 5, other than by the 
importance of ensuring that PD addresses all four of insights, goals, techniques and practice. 
However, the imprecision of this finding means this latter recommendation should be kept 


under close review as further evaluations of PD using a balanced design are conducted. 
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The theory may also be useful for researchers in helping to express with clarity what 
is involved in the PD programmes that they evaluate. An important limitation of our analysis 
is that we did not achieve agreement on 18% of the mechanism codes for the studies in our 
sample. Looking beyond our study, this is clearly problematic, since attempts to scale-up or 
imitate successful interventions requires clarity on the causally active components of the PD 
programme. While tools have been developed to aid more precise descriptions of 
interventions in the medical literature (Hoffman et al., 2014) our review highlights that the 
education literature still has some way to go in this respect. Using our framework to describe 
the design of PD in evaluation reports would be a step forward in helping researchers 
increase the precision with which they report the likely causally active components — thus 
reducing ambiguity where ambiguity matters most. 

For a theory to be fruitful — our final evaluative criteria — it should suggest avenues 
and hypotheses for future research (Ivani, 2018; Van Lange, 2015). We have tested two such 
hypotheses here and these should of course be tested further as new experimental evaluations 
using standardized test scores are published. As an auxiliary hypothesis, we suggest that 
researchers aim to achieve at least 82% item-level agreement when coding new PD 
evaluations prior to using this data in future tests. We look exclusively at test score outcomes 
in this analysis, however our theory also makes a number of predictions about intermediate 
outcomes (see Appendix B). Future tests could therefore also address whether the number of 
mechanisms addressing how e.g. Insight or Techniques predict measured changes in teacher 
knowledge or practice. 

In conclusion, we submit that the I/G/T/P theory represents an important advance over 
existing theories of effective PD and, notwithstanding some important caveats in our 
empirical results, provides what is now the best-corroborated, genuinely explanatory account 


of what differentiates more and less effective teacher PD (Haig, 2009). 
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Tables 


Table 1 


Summary of how PD can fail to bring about sustained improvements in teaching and learning 


Instil Motivate Develop Embed 


Insight (1) Goals (G) Techniques (T) Practice (P) COnscatences 
Vv Jv Knowing-doing gap 
Vv Knowing-doing gap 
Vv Vv Vv Revert to established habits 
Vv Vv Vv Misapplication 
Vv Vv Vv Vv More likely to be effective 
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Table 2 
Combining the mechanism and IGTP 
Purpose Mechanism 
ait iiss 1. Manage cognitive load 
inset istene 2. Revisit prior learning 
3. Goal setting 
Motivate goals (G) 4. Credible source 
5. Praise/reinforce 
6. Instruction 
7. Practical social support 
Teach techniques (T) 8. Modelling 
9. Feedback 


10. Rehearsal 


11. Prompts/cues 

12. Action planning 

13. Self-monitoring 

14. Context-specific repetition 


Embed practice (P) 
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Table 3 
Descriptive statistics for the meta-analytic sample 
Characteristics Count Proportion 
Location 
USA 73 70.2% 
UK 25 24.0% 
Other 6 5.8% 
Age group 
Early years/Pre-kindergarten 29 27.9% 
Primary/Elementary 52 50.0% 
Middle/Secondary/High 28 26.9% 
Subject targeted 
Literacy/first language 52 50.0% 
Maths 30 28.9% 
Science 12 11.5% 
Other subjects 6 5.8% 
Cross-curricular 17 16.4% 
Broad area of focus 
Cognitive science 1 1.0% 
Inquiry 16 15.4% 
Formative assessment 14 13.5% 
Data-driven instruction 7 6.73% 
Pre-registered 
Yes 26 25.0% 
No 78 75.0% 
What Works Clearinghouse Attrition 
Acceptable 36 34.6% 
Unacceptable/Unclear 68 65.4% 
Number of units randomized 
>50 64 61.5% 
<50 40 38.5% 
Test type 
High-stakes standardized 29 27.9% 
Low-stakes standardized 75 72.1% 
Total: 104 100% 


Note. Percentages may not sum to 100 within cells, and counts may not sum to 
104 within cells, due to rounding or due to sub-categories not being exhaustive. 
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FIGURE 1. Mechanisms descriptive statistics. N=104 PD programmes 
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FIGURE 2. Relationship between the number of mechanisms in a PD programme and impact on pupil 


test Scores 


Note. n= 104 studies. Uses the primary outcome as specified in the study or else one randomly selected outcome per study. 
Effect sizes >.5 or <-0.2 are used in the underlying meta-regression but are not shown in the figure to aid visual clarity. 
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Effect Size 
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FIGURE 3. Relationship between the number of mechanisms in a PD programme and impact on pupil 
test scores, by content area of the PD 


Note. N = number of separate experimental studies. Uses the primary outcome as specified in the study or else one randomly 
selected outcome per study. Effect sizes >.5 or <-0.2 are used in the underlying meta-regression but are not shown in the 


figure to aid visual clarity. 
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FIGURE 4. Relationship between the number of mechanisms in a PD programme and impact on pupil 
test scores, by indicators of study quality 


Note. N = number of separate experimental studies. ‘Large trials’ involve more than 50 units randomized to treatment or 
control. Uses the primary outcome as specified in the study or else one randomly selected outcome per study. Effect sizes > 
.5 or < -.2 are used in the underlying meta-regression but not shown in the chart in order to aid visual clarity. 
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FIGURE 5. Average impact of PD on test scores, by how many of the four ‘purposes of PD’ are 
addressed by the PD 


Note. k = number of effect sizes. n = number of separate experimental studies. Random effects meta-analysis, incorporating 
all standardized test score outcomes using robust variance estimation. Vertical lines represent 95% confidence intervals. 
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FIGURE 6. Sensitivity analysis for meta-analytic average impact of PD on test scores, by how many 
of the four ‘purposes of PD’ are addressed by the PD design 


Note. k = number of effect sizes. n = number of separate experimental studies. Random effects meta-analysis, incorporating 
all standardized test score outcomes using robust variance estimation. Vertical lines represent 95% confidence intervals. 
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Appendix A: Description of additional mechanisms 


The three insight mechanisms not discussed in the body of the text are instruction, 
feedback, and rehearsal. Instruction is the provision of directive advice on how to implement 
some practice. Instruction works by eliminating ambiguity about what is required to 
successfully use a procedure and has been shown to be beneficial in science education and 
medical training contexts (Kirschner et al., 2006; Sweller et al., 2019). Feedback is the 
provision of evaluative guidance based on prior observation of the target practice. It works by 
identifying and then advising on areas for improvement and has been shown to improve 
learning among pupils and motor-cognitive skills among dental and medical trainees (AI- 
Saud et al., 2017; Hatala et al., 2014; Ivers et al., 2012; Van Der Klejj et al., 2015). Finally, 
rehearsal refers to stuctured practice outside of a real classroom setting. This improves 
accuracy and speed of future performance. There is considerable correlational evidence for 
the importance of rehearsal across various domains (Macnamara et al., 2016) with causal 
evidence from medical education (McGaghie et al., 2011). 

The two embed practice mechanisms not discussed in the body of the text are 
prompts/cues and self-monitoring. Prompts/cues involves introducing environmental stimuli 
with the purposes of prompting the desired practice. Prompts/cues have been shown to trigger 
increased goal-directed behaviour in experimental research on gym attendance (Calzolari & 
Nardotto, 2017), changing doctors’ clinical practice (Shojania et al., 2010), and in increasing 
appointment attendance by patients (Guy et al., 2012). Finally, self-monitoring involves 
establishing a method for somebody to record and then review their own practice. Causal 
research shows that self-monitoring helps to embed health behaviour changes around weight 
loss, sleep hygiene and physical activity (Burke et al., 2011; Compernolle et al., 2019; Todd 


& Mullan, 2014). 
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Appendix B: Formalising the theory and hypotheses 


Below is a formal statement of the hypothesized relationship between the fourteen mechanisms 


X4, Xz, ...%1,4 and the four I/G/T/P purposes of PD. The subscripts on the x’s correspond to the 


numbers in Table 2. 


I= fi %4) 
G = f(Xizs Xi) 
T = f(Dise Xi) 
P = f (Dieu Xi) 


f'@) >1 
f'@) >1 
fi'@) >1 
f'@) >1 


Formal statement of Hypothesis 1 (H1): 


TestScores = f(Xi2,%1) f'(x)>1 


Formal definition of a balanced design: 


A ‘balanced’ PD design satisfies the following: (x; V x2) A (%3 V X4 V X5) A (%6 V X7 V Xg V 
Xq V X10) A (X11 V X12 V X43 V X44) 
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Appendix C: PRISMA 


FIGURE A1. PRISMA flow diagram 
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Appendix E: Average impacts by indicators of study quality 


Full Low High >50 <51 Preree Not 
Sample attrit. attrit. units units " pre-reg. 
Estimate 0.05** 0.018* 0.082**  0.036**  0.096** 0.005 0.074** 
Std. Error (0.009) (0.007) (0.014) (0.009) (0.018) (0.007) (0.012) 
k[n] 205[104] 49[36] 156[68] 106[65] 104[39]  32[26] 173[78] 
Difference NA p=0.006 p=0.008 p=0.0001 


Notes: Low/High attrit. (attrition) is based on the What Works Clearinghouse ‘cautious’ standards for acceptable 
attrition at both the cluster and pupil level. >50 units means that the trial randomized more than 50 units to treatment 
and control. Pre-reg. = the trial was pre-registered before it was conducted. Numbers in round parentheses are 
standard errors. k is number of effect sizes and n is number of experimental studies. **p<0.01. *p<0.05. Calculated 
using random effects robust variance estimation meta-analysis. 
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Appendix F: Probing explanations for difference among pre-registered studies 
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Note: The observed p-curve includes 19 statistically significant (po < .05) results, of which 15 are p < .025. 
There were 85 additional results entered but excluded from p-curve because they were p > .05. 
Indicators of higher methods standards Indicators of less effective PD 
High stakes ‘Acceptable’ No. of units PD + No. of I,G,T&P 
test score attrition randomized curric/tech mechanisms mechanisms 

Pre-reg 34.6% 65.4% 149 42.3% 4.2 7.7% 
Not 25.6% 24.4% 67.9 44.9% 5.2 8.9% 


Notes: ‘Acceptable’ attrition is defined in line with the What Works Clearinghouse standards. ‘PD + curric/tech’ implies 
the PD programme also had a curriculum reform or educational technology element. ‘I, G, T & P mechanisms’ implies 
that a PD programme has at least one mechanism in each of the Insight, Goals, Technique and (embed) Practice 
categories. 


' For an example database search see Appendix 2 of Sims et al. (2021). Further details about search terms are 
available on request from the authors. 

? Australian Education Index (Proquest); British Education Index (BEI); EconLit (EBSCO); Education 
Resources Information Center (ERIC) (EBSCO); Education Abstracts (EBSCO); Educational Administration 
Abstracts (EBSCO); EPPI-Centre database of education research; ProQuest Dissertations & Theses; PsycINFO 
(OVID); Teacher Reference Center (EBSCO); Google Scholar. 

3 Cordingley et al., 2015; Desimone, 2009; Dunst et al., 2015; Kennedy, 2016; Kraft et al., 2018; Lynch et al., 
2019; Rogers et al., 2020; Timperley et al., 2007; Walter & Briggs, 2012; Wei et al., 2009; Yoon et al., 2007. 

4 Forward citation searching was done for all included studies that were available in Microsoft Academic. 

5 Center for Coordinated Education MRDC publications; CUREE—Centre for the use of evidence and research 
in education; Digital Education Resource Archive; Education Endowment Foundation (EEF); EIPEE search 
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portal; EPPICentre database of education research; Institute of Education Studies What Works Clearinghouse; 
Nuffield Foundation. 

® See pages 80-81 of Sims et al. (2012) for further details. 

7Tn Sims et al. (2021) where we show that the main results are no different when we use Hedges’ g effect sizes 
among the 102 studies for which it was available 

8 See appendix 7 in Sims et al. (2021) for all three analyses. 

° By most intensive, we mean that the other versions of the intervention include (1) some but not all of the same 
components, and (2) no additional components. Where it was not possible to clearly distinguish more and less 
intensive versions, we picked a version at random. 

'0 See Appendix 5 in Sims et al. (2021) for the full coding frame. 

'! Cognitive science is PD focused on the use of findings from cognitive science relating to how memory works 
and how humans learn. Formative assessment is PD focused on how to elicit evidence of pupil understanding 
and then use this evidence to adapt the next steps in instruction. Inquiry is PD focused on pedagogy that 
encourages students to construct knowledge for themselves via solving problems and completing authentic 
tasks, working with autonomy. Data driven instruction is PD using cyclical class-wide testing to systematically 
collect data on pupil progress and then refocusing or differentiating instruction based on the findings. For more 
on this, see page 16 of Sims et al. (2021). 

'2 We define a test as being high stakes if it’s administration is a legal requirement by any level of government. 
13 

https://d2tic4wvo liusb.cloudfront.net/documents/guidance/EEF. Systematic_Review_of Professional Develop 
ment. Dr_Sam_Sims._Protocol.pdf 

4 See the ‘DoE’ column of table 1, which finds that government funded trials (which are often pre-registered) 
have effect sizes around one third to one half the size of effects more generally. 
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